Skip to content
This repository has been archived by the owner on Sep 27, 2019. It is now read-only.

Add first implementation of augmentedNN to predict selectivity #1473

Merged
merged 11 commits into from
Sep 26, 2018

Conversation

yetiancn
Copy link
Contributor

@yetiancn yetiancn commented Aug 17, 2018

The model is an initial implementation to predict selectivity for range predicates.
It can be applied to queries like:
SELECT * FROM table WHERE c >= l AND c <= u.

I implement the model in augmentedNN.py and cpp wrapper code in augmentedNN.cpp, taking LSTM.py and LSTM.cpp as a reference.
Hyperparameters, especially number of training epochs, need to be discussed based on real system experiments.
Test cases for the model are also added. The test cases include a uniform distribution dataset and a skewed distribution dataset.

There are two classes defined.

  1. class AugmentedNN (in augmentedNN.cpp). This class is just like class TimeSeriesLSTM.
  • Fit(): applies backpropagation.
  • Predict(): returns the predictions for the input.
  • TrainEpoch(): trains for one epoch.
  • ValidateEpoch(): uses one epoch for validation.
  1. class TestingAugmentedNNUtil (in testing_forecast_util.cpp)
  • GetData(): generates data for training and testing. Dataset is uniform or skewed distributed.
  • Test(): calls the APIs mentioned above to train and test the model.

Btw, in testing_forecast_util.cpp, the argument of matrix_eig::bottomRows was wrong. It should be the number of rows counted from the bottom of the matrix_eig. I've modified it. Please check if I am right.

@coveralls
Copy link

coveralls commented Aug 17, 2018

Coverage Status

Coverage decreased (-0.2%) to 76.528% when pulling dc1a075 on yetiancn:master into 1fc8b55 on cmu-db:master.

@yetiancn yetiancn force-pushed the master branch 2 times, most recently from 35692f3 to 8c6ec93 Compare August 18, 2018 00:41
@GustavoAngulo GustavoAngulo self-requested a review August 20, 2018 23:27
@apavlo apavlo requested a review from linmagit August 21, 2018 17:40
Copy link
Member

@linmagit linmagit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through the code before the tests. The high-level structure looks reasonable to me. I think we still need more comments and documentation to get a better understanding of the code.

For the questions I have, please directly add comments to the code if possible. I'll take a second pass along with the tests part after they're addressed.

float learn_rate, int batch_size, int epochs);
/**
* Train the Tensorflow model
* @param mat: Contiguous time-series data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time-series data? Looks like you need to update the document here.

using TfFloatIn = TfSessionEntityInput<float>;
using TfFloatOut = TfSessionEntityOutput<float>;

class AugmentedNN : public BaseTFModel {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a higher lever documentation on what this thing is doing. That is, from a DBMS perspective, what is this AugmentedNN is modeling? What is it trying to predict? What is the input data, and what is the output data?

* However instead of applying backprop it obtains predicted values.
* Then the validation loss is calculated for the relevant sequence
* - this is a function of segment and horizon.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as this function. Looks like the comments are outdated.

// Function to generate the args string to feed the python model
std::string ConstructModelArgsString() const;
// Attributes needed for the Seq2Seq LSTM model(set by the user/settings.json)
int ncol_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use the abbreviation in variable names. Instead, use column_number_ or at least column_num_ if n means number here.

// Attributes needed for the Seq2Seq LSTM model(set by the user/settings.json)
int ncol_;
int order_;
int nneuron_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. And we need comments on what these member variables are.

float ValidateEpoch(const matrix_eig &mat);

void Fit(const matrix_eig &X, const matrix_eig &y, int bsz) override;
matrix_eig Predict(const matrix_eig &X, int bsz) const override;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to comment these functions as well. What are they doing? What the matrices in the function arguments exactly are? I think sometimes the matrix is the input, sometimes output, and sometimes both of them.

self.optimize

@staticmethod
def jumpActivation(k):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use the same naming convention. Something like jump_activation() for functions.

def jumpActivation(k):
def jumpActivationk(x):
return tf.pow(tf.maximum(0.0, 1-tf.exp(-x)), k)
return jumpActivationk
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After all, what is this jump activation thing with different orders on each layer? Is there any reference? Or can you explain the intuition behind it?

@yetiancn yetiancn changed the title add first implementation of augmentedNN to predict selectivity Add first implementation of augmentedNN to predict selectivity Sep 4, 2018
Copy link
Member

@linmagit linmagit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the documentation! I have a few more comments.

I think it's better to create another selectivity folder under the brain directory (for both header and source directory) and put your new stuff there. For the tests, you should put the tests for the new model in a different file, and put TestingAugmentedNNUtil in a file called testing_augmented_nn_util.h(cpp).

Also, I prefer using all lower case augmented_nn in the file names instead of augmentedNN, but that's minor.

size_t split_point =
data.rows() - static_cast<size_t>(data.rows() * val_split);

// Split into train/test data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is splitting into train/validate data, and the test data is separate, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I'll update the comment here.

}

TEST_F(ModelTests, DISABLED_AugmentedNNSkewedTest) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we probably cannot enable them in the CI, but have you tested these tests locally?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've tested them.

* SELECT * FROM table WHERE c1 >= l1 AND c1 <= u1
* AND c2 >= l2 AND c2 <= u2
* AND ...
* Input is [l1, u1, l2, u2, ...]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the tests, it looks like right now we're doing only one pair of predicates, just [l1, u1], right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right.

@yetiancn yetiancn force-pushed the master branch 2 times, most recently from f54b380 to a716df1 Compare September 7, 2018 20:24
Copy link
Member

@linmagit linmagit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you have a duplicated Python script. Other than that it looks good to me.

@yetiancn yetiancn force-pushed the master branch 2 times, most recently from 1457242 to 813321d Compare September 17, 2018 02:34
@apavlo apavlo merged commit 6898305 into cmu-db:master Sep 26, 2018
mtunique pushed a commit to mtunique/peloton that referenced this pull request Apr 16, 2019
…b#1473)

* add first implementation of augmentedNN to predict selectivity for range predicates

* add first implementation of augmentedNN to predict selectivity

* add first implementation of augmentedNN to predict selectivity

* add comments and modify variable names

* rename some variables

* create brain/selectivity; create new test file for augmented_nn.

* remove duplicated files

* check if travis is ok
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants